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In this short letter we present the construction of a bi-stochastic kernel p for an arbitrary data set 
X that is derived from an asymmetric affinity function a. The affinity function a measures the 
similarity between points in X and some reference set Y. Unlike other methods that construct 
bi-stochastic kernels via some convergent iteration process or through solving an optimization 
problem, the construction presented here is quite simple. Furthermore, it can be viewed through 
_^ the lens of out of sample extensions, making it useful for massive data sets. 
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. 1. Introduction 

fx. . 

_ Given a positive, symmetric kernel (matrix) k, the question of how to construct a bi-stochastic 

£SJ ■ kernel derived from k has been of interest in certain applications such as data clustering. Various 

algorithms for this task exist. One of the best known is the Sinkhorn-Knopp algorithm yj], in 
■ which one alternately normalizes the rows and columns of k to sum to one. A symmetrization of 

this algorithm is given in J2l, and is subsequently used to cluster data. In both cases, an infinite 
number of iterations are needed for the process to converge to a bi-stochastic matrix. In another 
application of data clustering, the authors in [ 3] solve a quadratic programming problem to obtain 
what they call the Bregmanian bi-stochastication of k. Common to these algorithms and others 
is the complexity in solving for (or approximating) the bi-stochastic matrix. 

Also related to the goal of organizing data, over the last decade we have seen the develop- 
ment of a class of research that utilizes nonlinear mappings into low dimensional spaces in order 
to organize potentially high dimensional data. Examples include locally linear embedding (LLE) 
[4], ISOMAP yD, Hessian LLE Q], Laplacian eigenmaps J3l, and diffusion maps In many 
applications, these data sets are not only high dimensional, but massive. Thus there has been the 
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need to develop complementary methods by which these nonlinear mappings can be computed 
efficiently. The Nystrom extension is one early such example; in J9[] several out of sample ex- 
tensions are given for various nonlinear mappings, while II 1 Qfl utilizes geometric harmonics to 
extend empirical functions. 

In this letter we present an extremely simple bi-stochastic kernel construction that can also 
be implemented to handle massive data sets. Let X be the data set. The entire construction is 
derived not from a kernel on X, but rather an asymmetric affinity function a : X x Y — > M 
between the given data and some reference set Y. The key to realizing the bi-stochastic nature 
of the derived kernel is to apply the correct weighted measure on X. The eigenfunctions (or 
eigenvectors) of this bi-stochastic kernel on X are also easily computable via a Nystrom type 
extension of the eigenvectors of a related kernel on Y. Since the reference set can usually be 
taken to much smaller than the original data set, these eigenvectors are simple to compute. 

2. A simple bi-stochastic kernel construction 

We take our data set to be a measure space (X,fj), in which fi represents the distribution 
of the points. We also assume that we are given, or able to compute, a finite reference set 
Y = {yi, . . . ,y„). Note that one can take X to be discrete or finite as well; in particular, one 
special case is when X - Y and [i = - Yj"=\ S yv 

2.1. Affinity functions and densities 

Let a : X x Y — > Ibea positive affinity function that measures the similarity between the 
data set X and the reference set Y. Larger values of a(x, y,) indicate that the two data points are 
very similar, while those values closer to zero imply that x and y, are quite different. 

We derive two densities from the affinity function a, which we shall then use to normalize it. 
The first of these is the density Q. : X — > K on the data set X; we take it as 




for all x e X. 



We also have a density u> : Y — > M on the reference set, which we define as: 




for all y, e 



Y. 



Assumption 1. We make the following simple assumptions concerning a: 
1. For each y, e Y, the function a{-,y,) : X — > K is square integrable, i.e., 

a(-,y,)eL 2 (X,^). 



(1) 



2. The densities Q and a> are finite and strictly positive: 



< D.(x) < oo, for all x € X, 
< a)(yi) < oo, for all y, e Y. 



(2) 
(3) 
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The L 2 integrability condition is necessary to make sure that functions and operators related 
to X make sense. The upper bounds on the densities simply put a finite limit on how close any 
point x or y,- is to either Y or X, respectively. Meanwhile, the lower bounds state that each point 
in X has some relation to the reference set Y, and likewise that each reference point y, has some 
similarity to at least part of X. 

Using a and the two densities Q and a>, we define a normalized affinity fj : X x Y — > K as 

P(x, yd = " (X ' y '\ , for all (x, yi) e X x Y 
Ll(x) u(yd 

From this point forward we will use the weighted measure Q. 2 p on X. This measure is the 
"correct" measure in the sense that it is the measure for which we can define a bi-stochastic 
kernel on X in a natural, simple way. Using Assumption [T] one can easily show that for each 
yi € Y, the function /3(-,yi) : X — > R is well behaved under this measure: 

/?(-, yd € L 2 {X, Q 2 fx), for all y t e Y. 

We note that affinity functions similar to yS were first considered in ifTHl in the context of out 
of sample extensions for independent components analysis (ICA). It has also been utilized in 



111211 in the context of filtering. The connection to bi-stochastic kernels, though, has until now 



gone unnoticed. 

2.2. The bi-stochastic kernel 

To construct the bi-stochastic kernel on X we utilize the /3 affinity function. Let p : XxX — > K 
denote the kernel, and define it as: 

p{x,x')±{p(x,-),p{x',-))w 

n 

= Y J /3(x,yd/3(x',y i % for all (x, x')eXx X. 
i=i 

The following proposition summarizes the main properties of p. 
Proposition 2. If Assumption\I\holds, then: 

1. The kernel p is square integrable under the weighted measure Q. 2 p, i.e., 

p e L 2 (XxX,Q. 2 n®n 2 p). 

2. The kernel p is bi-stochastic under the weighted measure Q. 2 p, i.e., 

j p(x,x')Q.{x') 2 p(x') = j p{x\x)Q.(x') 2 p(x) = 1, forallxeX. 

X X 

Proof. We begin with L 2 integrability condition. Using the definition of p, expanding its L 2 
norm, and applying Holder's theorem gives: 



Mmxxxnzuton^ = j j Yj P {x > yi)/3(x',yd/3(x, yj)p{x', yj) Q(x') 2 Q.{x) 2 dp(x)dp(x') 

X X ''j =l 



tlL 2 (XxX,Q. 2 f i®Q. 2 f i) ~ 



< oo . 
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Now we show that p is bi-stochastic: 



f p(x,x')n(x') 2 dp.(x')= f V ^'^'fl Q(x') 2 dp(y)dp(x') 
J J j-f Q(x) Q(x') uiyiY 

X X !_1 

j-f Q(x) u)(yd l J 
= 1. 

Since p is clearly symmetric, this completes the proof. □ 

Since p e L 2 (X x X, Cl 2 p ® Q 2 //), one can define the integral operator P : L?{X, Q 2 yu) — > 
L 2 (X, Q 2 ^) as: 

A I „/ „'\n/-„/\2 j../ „>\ f n f ^ ( 2/ v n2, 



J 



{Pf)(x)= p(x,x')f(x')Q(x'ydp(x'), fora\\feL z (X,n z p) 



Given the results of Proposition^ we see that P is a Hilbert-Schmidt, self adjoint, diffusion oper- 
ator. In terms of data organization and clustering, it is usually the eigenfunctions and eigenvalues 
of P that are of interest; see, for example, Q8|] . The fact that P is bi-stochastic though, as opposed 
to merely row stochastic, could make it particularly interesting for these types of applications in 
which / - P is used an approximation of the Laplacian. 

2.3. A Ny strom type extension 

The affinity ft can also be used to construct an n x n matrix A as follows: 

= j P(x, yi) P(x, yj) Q(x) 2 dp(x), for all i, j = 1, . . . , n. 

x 

The matrix A is useful for computing the eigenfunctions and eigenvalues of P. The following 
proposition is simply an interpretation of the singular value decomposition (S VD) in this context. 

Proposition3. IfAssumption\T\holds, then: 

1. Let A £ M \ {0}. Then A is an eigenvalue ofP if and only if it is an eigenvalue of A. 

2. Let A 6 M \ {0}. If if/ e L 2 (X, Q 2 /i) is an eigenfunction of P with eigenvalue A and v e R" is 
the corresponding eigenvector of A, then: 

1 " 

= - F Y.p{x,y i )v[i\, 
yA , =1 

v[i] = -±= J f3(x, yi )Hx)Q(x) 2 dp(x). 

x 
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